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Abstract 

We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). Each 
database vector is quantized in multiple subspaces via a set of codebooks, learned directly by minimizing the inner 
product quantization error. Then, the inner product of a query to a database vector is approximated as the sum of 
inner products with the subspace quantizers. Different from recently proposed LSH approaches to MIPS, the database 
vectors and queries do not need to be augmented in a higher dimensional feature space. We also provide a theoretical 
analysis of the proposed approach, consisting of the concentration results under mild assumptions. Furthermore, if a 
small sample of example queries is given at the training time, we propose a modified codebook learning procedure 
which further improves the accuracy. Experimental results on a variety of datasets including those arising from deep 
neural networks show that the proposed approach significantly outperforms the existing state-of-the-art. 


1 Introduction 

Many information processing tasks such as retrieval and classification involve computing the inner product of a query 
vector with a set of database vectors, with the goal of returning the database instances having the largest inner products. 
This is often called Maximum Inner Product Search (MIPS) problem. Formally, given a database X = 
and a query vector q drawn from the query distribution Q, where Xi,q G we want to find x* G X such that 
X* = argmax^gj,^ {(Fx). This definition can be trivially extended to return top-A largest inner products. 

The MIPS problem is particularly appealing for large scale applications. For example, a recommendation system 
needs to retrieve the most relevant items to a user from an inventory of millions of items, whose relevance is commonly 
represented as inner products 0. Similarly, a large scale classification system needs to classify an item into one of the 
categories, where the number of categories may be very large IH. A brute-force computation of inner products via a 
linear scan requires 0(nd) time and space, which becomes computationally prohibitive when the number of database 
vectors and the data dimensionality is large. Therefore it is valuable to consider algorithms that can compress the 
database X and compute approximate x* much faster than the brute-force search. 

The problem of MIPS is related to that of Nearest Neighbor Search with respect to L 2 distance (L 2 NNS) or angular 
distance (0NNS) between a query and a database vector; 

q^x = l/2(||x|p -f ||g|p - ||g- = HqllllxH cos6», 

or 

aigmax.{F x) = argmax(||x|p — H? — x\\‘^) = argmax(||x||cos0), 

xGX x£X xGX 

where 11.11 is the L 2 norm. Indeed, if the database vectors are scaled such that | |x| | = constant Vx G X, the MIPS 
problem becomes equivalent to L 2 NNS or 0NNS problems, which have been studied extensively in the literature. 
However, when the norms of the database vectors vary, as often true in practice, the MIPS problem becomes quite 
challenging. The inner product (distance) does not satisfy the basic axioms of a metric such as triangle inequality and 
co-incidence. For instance, it is possible to have x^x < x'^y for some y ^ x. In this paper, we focus on the MIPS 
problem where both database and the query vectors can have arbitrary norms. 

As the main contribution of this paper, we develop a Quantization-based Inner Product (QUIP) search method to 
address the MIPS problem. We formulate the problem of quantization as that of codebook learning, which directly 
minimizes the quantization error in inner products (Sec. [^. Furthermore, if a small sample of example queries is 
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provided at the training time, we propose a constrained optimization framework which further improves the accu¬ 
racy (Sec. 3.2 1 . We also provide a concentration-based theoretical analysis of the proposed method (Sec.|^. Extensive 
experiments on four real-world datasets, involving recommendation {Movielens, Netflix) and deep-learning based clas¬ 
sification {ImageNet and VideoRec) tasks show that the proposed approach consistently outperforms the state-of-the-art 
techniques under both fixed space and fixed time scenarios (Sec.|^. 


2 Related works 

The MIPS problem has been studied for more than a decade. For instance, Cohen et al. 0 studied it in the context of 
document clustering and presented a method based on randomized sampling without computing the full matrix-vector 
multiplication. In ifTOlfTJI . the authors described a procedure to modify tree-based search to adapt to MIPS criterion. 
Recently, Bachrach et al. ||2l proposed an approach that transforms the input vectors such that the MIPS problem 
becomes equivalent to the L 2 NNS problem in the transformed space, which they solved using a PCA-Tree. 

The MIPS problem has received a renewed attention with the recent seminal work from Shrivastava and Li ca, 
which introduced an Asymmetric Locality Sensitive Hashing (ALSH) technique with provable search guarantees. 
They also transform MIPS into L 2 NNS, and use the popular LSH technique HI. Specifically, ALSH applies different 
vector transformations to a database vector x and the query q, respectively; 

x= q= [q; 1/2; 1/2; • • • ;l/2]. 

where x = Uq - ^ n m , Uq is some constant that satisfies 0 < Uq < 1, and m is a nonnegative integer. Hence, 

11^11 

x and q are mapped to a new {d + m) dimensional space asymmetrically. Shrivastava and Li ifTSl showed that when 
m ^ 00 , MIPS in the original space is equivalent to L 2 NNS in the new space. The proposed hash function followed 
LaLSHformCJ: hf‘^{x) = , where Pi is a (d -f m)-dimensional vector whose entries are sampled i.i.d 

from the standard Gaussian, Af{0, 1), and bi is sampled uniformly from [0,r]. The same authors later proposed an 
improved version of ALSH based on Signed Random Projection (SRP) ifThl . It transforms each vector using a slightly 
different procedure and represents it as a binary code. Then, Hamming distance is used for MIPS. 

^ = [ip - PiPp - ll^lh ■ • p - PIP”]. 9 = [ 9 ; 0 ; 0 ;--- ;0], and 
hf^^ix) = signiPf x); Dtst^^^ix, q) hf^^ix) ^ hf^^iq). 

i=l 

Recently, Neyshabur and Srebro ifT^ argued that a symmetric transformation was sufficient to develop a provable LSH 
approach for the MIPS problem if query was restricted to unit norm. They used a transformation similar to the one 
used by Bachrach et al. Q to augment the original vectors: 


X = [i;^/! - PIP]. q=[q-,0]. 

where x = max gxILH ’ ^ “ TlflT' They showed that this transformation led to significantly improved results over the 
SRP based LSH from fTh). In this paper, we take a quantization based view of the MIPS problem and show that it 
leads to even better accuracy under both fixed space or fixed time budget on a variety of real world tasks. 

3 Quantization-based inner product (QUIP) search 

Instead of augmenting the input vectors to a higher dimensional space as in uniiia, we approximate the inner products 
by mapping each vector to a set of subspaces, followed by independent quantization of database vectors in each 
subspace. In this work, we use a simple procedure for generating the subspaces. Each vector’s elements are first 
permuted using a random (but fixed) permutatiorQ Then each permuted vector is mapped to K subspaces using 
simple chunking, as done in product codes Glii. For ease of notation, in the rest of the paper we will assume that 

* Another possible choice is random rotation of the vectors which is slightly more expensive than permutation but leads to improved theoretical 
guarantees as discussed in the appendix. 
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both query and database vectors have been permuted. Chunking leads to block-decomposition of the query g ~ Q and 
each database vector x G X: 


X= [xW;x(2);... ;:eW] g= [g(l);g(2);... 

where each S = \d/K~\ Q The subspace containing the blocks of all the database vectors, 

{x^^^}i=i,,,n, is then quantized by a codebook G where Ck is the number of quantizers in subspace k. 

Without loss of generality, we assume Ck = C V k. Then, each database vector x is quantized in the k^^ subspace 
as x^^'^ « \ where is a C-dimensional one-hot assignment vector with exactly one 1 and rest 0. Thus, a 

database vector x is quantized by a single dictionary element in the subspace. Given the quantized database 
vectors, the exact inner product is approximated as: 

^ ^ ^ (1) 


Note that this approximation is ’asymmetric’ in the sense that only database vectors x are quantized, not the query 
vector q. One can quantize q as well but it will lead to increased approximation error. In fact, the above asymmetric 
computation for all the database vectors can still be carried out very efficiently via look up tables similar to ||9l, except 
that each entry in the k*^ table is a dot product between q^^'> and columns of . 

Before describing the learning procedure for the codebooks and the assignment vectors V x, k, we first 
show an interesting property of the approximation in Q. Let 5'c^^ be the partition of the database vectors in 
subspace k such that [c] = 1}, where [c] is the element of and is the column 

of C/W. 


Lemma 3.1. = 


\s: 


(fe)i 


x^^\ then {1) is an unbiased estimator of q"^x. 


Proof. 


E [q’^x 

q~Cl,x^X 



Where I is the indicator function, and the last equality holds because for each k, E^^ = 0 by 

definition. □ 


We will provide the concentration inequalities for the estimator in Q in Sec.|^ Next we describe the learning of 
quantization codebooks in different subspaces. We focus on two different training scenarios: when only the database 
vectors are given (Sec. [ 33 , and when a sample of example queries is also provided (Sec. |3.2| l. The latter can result in 
significant performance gain when queries do not follow the same distribution as the database vectors. Note that the 
actual queries used at the test time are different from the example queries, and hence unknown at the training time. 


3.1 Learning quantization codebooks from database 

Our goal is to learn data quantizers that minimize the quantization error due to the inner product approximation given 
in 0. Assuming each subspace to be independent, the expected squared error can be expressed as: 


E E \q^x — 

q^QxGX^ 






2 


k 


^One can do zero-padding wherever necessary, or use different dimensions in each block. 


( 2 ) 
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where Eq^ = q qik)q{k)T jjjg non-centered query covariance matrix in subspace k. Minimizing the eiTor in 0 
is equivalent to solving a modified k-Means problem in each subspace independently. Instead of using the Euclidean 

(k) 

distance, Mahalanobis distance specified by Eq is used for assignment. One can use the standard Lloyd’s algorithm 
to find the solution for each subspace k iteratively by alternating between two steps: 


c 


(fe) 


U(k) 


argmin(a:('=) - = 1, Vc,x 

C 


(3) 


The Lloyd’s algorithm is known to converge to a local minimum (except in pathological cases where it may oscillate 
between equivalent solutions) a. Also, note that the resulting quantizers are always the Euclidean means of their 
corresponding partitions, and hence. Lemma [TT] is applicable to Q as well, leading to an unbiased estimator. 

The above procedure requires the non-centered query covariance matrix Eq, which will not be known if query 
samples are not available at the training time. In that case, one possibility is to assume that the queries come from the 
same distribution as the database vectors, i.e., Eq = Ex- In the experiments we will show that this version performs 
reasonably well. However, if a small set of example queries is available at the training time, besides estimating the 
query covariance matrix, we propose to impose novel constraints that lead to improved quantization, as described next. 


3.2 Learning quantization codebook from database and example query samples 

In most applications, it is possible to have access to a small set of example queries, Q. Of course, the actual queries 
used at the test-time are different from this set. Given these exemplar queries, we propose to modify the learning 
criterion by imposing additional constraints while minimizing the expected quantization error. Given a query q, since 
we are interested in finding the database vector x* with highest dot-product, ideally we want the dot product of query 
to the quantizer of x* to be larger than the dot product with any other quantizer. Let us denote the matrix containing 

the k*^ subspace assignment vectors for all the database vectors by Thus, the modified optimization is 

given as. 


argmin 
UW ,AW 

s.t. 


E 

qGQ 


xGX k 


k 


yq,x, 


where x* = argmaxq^x 

k k ” ^ 


(4) 


We relax the above hard constraints using slack variables to allow for some violations, which leads to the following 
equivalent objective: 


argmin E 


E + A E E E + 

x^X k q^Q xGX k 


(5) 


where [z]+ = max{z, 0) is the standard hinge loss, and A is a nonnegative coefficient. We use an iterative procedure 
to solve the above optimization, which alternates between solving and for each k. In the beginning, each 
codebook is initialized with a set of random database vectors mapped to the subspace. Then, we iterate 
through the following three steps: 


1 . 


Eind a set of violated constraints W with each element as a triplet, i.e., Wj = x*^, x“}j=i...j, where qj 

is an exemplar query, x*^ is the database vector having the maximum dot product with qj, and xj is a vector 
that qJx*_ > qJx~ but 


k 


J X. 

k 


G Q 

such 


2. Fixing and all columns of except one can update V x, A: as: 


+ = x"]-I[x = x^)), 

C 

= 1 
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(k) 

Since C is typically small (256 in our experiments), we can find Cx by enumerating all possible values of c. 

3. Fixing A, and all the columns of except one can update by gradient descent where gradient can 
be computed as: 

Vt/f) = ^ (gf - aif. [c])) 

x&X ] " ^ 


Note that if no violated constraint is found, step 2 is equivalent to finding the nearest neighbor of in C/^^Hn 
Mahalanobis space specified by Sq^- Also, in that case, by setting VUc^'^ = 0, the update rule in step 3 becomes 






which is the stationary point for the first term. Thus, if no constraints are violated. 


the above procedure becomes identical to k-Means-like procedure described in Sec. 3.1 The steps 2 and 3 are guar¬ 
anteed not to increase the value of the objective in Q. In practice, we have found that the iterative procedure can 
be significantly sped up by modifying the step 3 as perturbation of the stationary point of the first term with a single 
gradient step of the second term. The time complexity of step 1 is at most 0{nKC\Q\), but in practice it is much 
cheaper because we limit the number of constraints in each iteration to be at most J. Step 2 takes 0{nKC) and step 
3 0{{n + J)KC) time. In all the experiments, we use at most J = 1000 constraints in each iteration. Also, we fix 
A = .01, step size rjt = 1/(1 + t) at each iteration t, and the maximum number of iterations T = 30. 


4 Theoretical analysis 


In this section we present concentration results about the quality of the quantization-based inner product search 
method. Due to the space constraints, proofs of the theorems are provided in the appendix. We start by defining 
a few quantities. 

Definition 4.1. Given fixed a,e > 0, let J^{a, e) be an event such that the exact dot product q^x is at least a, but the 
quantized version is either smaller than q"^x{l — e) or larger than q^x{l + e). 

Intuitively, the probability of event J^(a, e) measures the chance that difference between the exact and the quantized 
dot product is large, when the exact dot product is large. We would like this probability to be small. Next, we introduce 
the concept of balancedness for subspaces. 

Definition 4.2. Let v be a vector which is chunked into K subspaces: We say that chunking is rj- 

balanced if the following holds for every /c G {1,..., K}: 

<(^ + (l-r7))||uf 

Since the input data may not satisfy the balancedness condition, we next show that random permutation tends to 
create more balanced subspaces. Obviously, a (fixed) random permutation applied to vector entries does not change 
the dot product. 

Theorem 4.1. Let v be a vector of dimensionality d and let perm{v) be its version after applying random permutation 
of its dimensions. Then the expected perm{v) is 1-balanced. 


Another choice of creating balancedness is via a (fixed) random rotation, which also does not change the dot- 
product. This leads to even better balancedness property as discussed in the appendix (see Theorem 2.1). Next we 
show that the probability of T{a, e) can be upper bounded by an exponentially small quantity in K, indicating that the 
quantized dot products accurately approximate large exact dot products when the quantizers are the means obtained 
from Mahalanobis k-Means as described in Sec. 3.1 Note that in this case quantized dot-product is an unbiased 


estimator of the exact dot-product as shown in Lemma 3.1 


Theorem 4.2. Assume that the dataset X of dimensionality d resides entirely in the ball B{p, r) of radius r, centered 
at p . Further, let {x — p : x £ X} be rj-balanced for some 0 < ?7 < 1, where \ is applied pointwise, and let 

be a martingale. Denote qmax — max^gg Then, there exist K sets 

of codebooks, each with C quantizers, such that the following is true: 


P(J'(a,e)) < 2e 


2K 
C d 
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The above theorem shows that the probability of F{a,e) decreases exponentially as the number of subspaces 
(i.e., blocks) K increases. This is consistent with experimental observation that increasing K leads to more accurate 
retrieval. 

Furthermore, if we assume that each subspace is independent, which is a slightly more restrictive assumption than 
the martingale assumption made in Theorem |4.2[ we can use Berry-Esseen IH inequality to obtain an even stronger 
upper bound as given below. 

Theorem 4.3. Suppose, A = uYBx.k=i^...^K where A^^^ = max^; | \u^x'^ \ \ is the maximum distance between 

1 

a datapoint and its quantizer in subspace k. Assume A < . Then, 

Qmax 


P(J-(c 




r (k) _ ±_ 

,e)) < ^ 


rJk)-,2 




V27r|A| 


ae 


/i-i - 


where 




and (3 > 0 is some universal constant. 


5 Experimental results 


We conducted experiments with 4 datasets which are summarized below: 

Movielens This dataset consists of user ratings collected by the MovieLens site from web users. We use the same 
SVD setup as described in the ALSH paper ifTSl and extract 150 latent dimensions from SVD results. This dataset 
contains 10,681 database vectors and 71,567 query vectors. 

Netflix The Netflix dataset comes from the Netflix Prize challenge 0. It contains 100,480,507 ratings that users gave 
to Netflix movies. We process it in the same way as suggested by Qa. That leads to 300 dimensional data. There 
are 17,770 database vectors and 480,189 query vectors. 

ImageNet This dataset comes from the state-of-the-art GoogLeNet iflTl image classifier trained on ImageNej^ The 
goal is to speed up the maximum dot-product search in the last i.e., classification layer. Thus, the weight vectors 
for different categories form the database while the query vectors are the last hidden layer embeddings from the 
ImageNet validation set. The data has 1025 dimensions (1024 weights and 1 bias term). There are 1,000 database 
and 49,999 query vectors. 


VideoRec This dataset consists of embeddings of user interests 0, trained via a deep neural network to predict a 
set of relevant videos for a user. The number of videos in the repository is 500,000. The network is trained with 
a multi-label logistic loss. As for the ImageNet dataset, the last hidden layer embedding of the network is used as 
query vector, and the classification layer weights are used as database vectors. The goal is to speed up the maximum 
dot product search between a query and 500,000 database vectors. Each database vector has 501 dimensions (500 
weights and 1 bias term). The query set contains 1,000 vectors. 

Eollowing ca, we focus on retrieving Top-1, 5 and 10 highest inner product neighbors for Movielens and Netflix 
experiments. Eor ImageNet dataset, we retrieve top-5 categories as common in the literature. Eor the VideoRec 
dataset, we retrieve Top-50 videos for recommendation to a user. We experiment with three variants our technique: (1) 
QUIP-cov(x): uses only database vectors at training, and replaces Eq by Ex in the k-Means like codebook learning in 
Sec. 3.1 (2) QUIP-cov(q): uses Eq estimated from a held-out exemplar query set for k-Means like codebook learning, 
and (3) QUIP-opt: uses full optimization based quantization (Sec. |3.2| i. We compare the performance (precision-recall 
curves) with 3 state-of-the-art methods: (1) Signed ALSH ifTSll . (2) L2 ALSH ifTSTj ^ and (3) Simple LSH ifT^ . We 
also compare against the PCA-tree version adapted to inner product search as proposed in 0, which has shown better 
results than IP-tree ifTSll . The proposed quantization based methods perform much better than PCA-tree as shown in 
the appendix. 

We conduct two sets of experiments: ii) fixed bit - the number of bits used by all the techniques is kept the same, 
(ii) fixed time - the time taken by all the techniques is fixed to be the same. In the fixed bit experiments, we fix the 
number of bits to be 6 = 64,128, 256,512. Eor all the QUIP variants, the codebook size for each subspace, C, was 
fixed to be 256, leading to a 8-bit representation of a database vector in each subspace. The number of subspaces (i.e., 
blocks) was varied to be A: = 8,16,32,64 leading to 64,128, 256, 512 bit representation, respectively. Eor the fixed 
time experiments, we first note that the proposed QUIP variants use table lookup based distance computation while the 


^The original paper ensembled 7 models and used 144 different crops. In our experiment, we focus on one global crop using one model. 
“^The recommended parameters m = 3,Uo = 0.85, r = 2.5 were used in the implementation. 
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(a) Movielens dataset 



(b) Netflix dataset 


Figure 1: Precision Recall curves (higher is better) for different methods on Movielens and Netflix datasets, retrieving 
Top-1, 5 and 10 items. Baselines: Signed ALSH ifThll . L2 ALSH ifTSl and Simple LSH ifT^ . Proposed Methods: 
QUIP-cov(x), QUIP-cov(q), QUIP-opt. Curves for fixed bit experiments are plotted in solid line for both the baselines 
and proposed methods, where the number of bits used are b = 64,128, 256, 512 respectively, from left to right. 
Curves for fixed time experiment are plotted in dashed ^nes. The fixed time plots are the same as the fixed bit 
plots for the proposed methods. For the baseline methods, the number of bits used in fixed time experiments are 
b = 192, 384, 768,1536 respectively, so that their running time is comparable with that of the proposed methods. 



































































































(a) ImageNet dataset, retrieval of Top 5 items. 



(b) VideoRec dataset, retrieval of Top 50 items. 


Figure 2; Precision Recall curves for ImageNet and VideoRec. See appendix for more results. 

LSH based techniques use POPCNT-based Hamming distance computation. Depending on the number of bits used, 
we found POPCNT to be 2 to 3 times faster than table lookup. Thus, in the hxed-time experiments, we increase the 
number of bits for LSH-based techniques by 3 times to ensure that the time taken by all the methods is the same. 

Figure[T]shows the precision recall curves for Movielens and Netflix, and Figurej^shows the same for the ImageNet 
and VideoRec datasets. All the quantization based approaches outperform LSH based methods signihcantly when all 
the techniques use the same number of bits. Even in the hxed time experiments, the quantization based approaches 
remain superior to the LSH-based approaches (shown with dashed curves), even though the former uses 3 times less 
bits than latter, leading to signihcant reduction in memory footprint. Among the quantization methods, QUIP-cov(q) 
typically performs better than QUIP-cov(x), but the gap in performance is not that large. In theory, the non-centered 
covariance matrix of the queries (Eg) can be quite different than that of the database (Ejf), leading to drastically 
different results. However, the comparable performance implies that it is often safe to use Ex when learning a 
codebook. On the other hand, when a small set of example queries is available, QUIP-opt outperforms both QUIP- 
cov(x) and QUIP-cov(q) on all four datasets. This is because it learns the codebook with constraints that steer learning 
towards retrieving the maximum dot product neighbors in addition to minimizing the quantization error. The overall 
training for QUIP-opt was quite fast, requiring 3 to 30 minutes using a single-thread implementation, depending on 
the dataset size. 

6 Tree-Quantization Hybrids for Large Scale Search 

The quantization based inner product search techniques described above provide a signihcant speedup over the brute 
force search while retaining high accuracy. However, the search complexity is still linear in the number of database 
points similar to that for the binary embedding methods that do exhaustive scan using Hamming distance TO . When 
the database size is very large, such a linear scan even with fast computation may not be able to provide the required 
search efficiency. In this section, we describe a simple procedure to further enhance the speed of QUIPS based on 
data partitioning. The basic idea of tree-quantization hybrids is to combine tree-based recursive data partitioning with 
QUIPS applied to each partition. At the training time, one first learns a locality-preserving tree such as hierarchical 
k-means tree, followed by applying QUIPS to each partition. In practice only a shallow tree is learned such that each 
leaf contains a few thousand points. Of course, a special case of tree-based partitioners is a flat partitioner such as 
k-means. At the query time, a query is assigned to more than one partition to deal with the errors caused by hard 
partitioning of the data. This soft assignment of query to multiple partitions is crucial for achieving good accuracy for 
high-dimensional data. 

In the VideoRec dataset, where n = 500, 000, the quantization approaches (including QUIP-cov(x), QUIP-cov(q), 
QUIP-opt) reduce the search time by a factor of 7.17, compared to that of brute force search. The tree-quantization 
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(a) Fixed-bit experiment. (b) Fixed-time experiment. 

Figure 3: Precision recall curves on VideoRec dataset, retrieving Top-50 items, comparing quantization based methods 
and tree-quantization hybrid methods. In (a), we conduct fixed bit comparison where both the non-hybrid methods and 
hybrid methods use the same 512 bits. The non-hybrid methods are considerable slower in this case (5.97x). In (b), 
we conduct fixed time experiment, where the time of retrieval is fixed to be the same as taken by the hybrid methods 
(2.256ms). The non-hybrid approaches give much lower accuracy in this case. 

hybrid approaches (Tree-QUIP-cov(x), Tree-QUIP-cov(q), Tree-QUIP-opt) use 2000 partitions, and each query is 
assigned to the nearest 100 partitions based on its dot-product with the partition centers. These Tree-QUIP hybrids 
lead to a further speed up of 5.97x over QUIPS, leading to an overall end-to-end speed up of 42.Six over brute force 
search. To illustrate the effectiveness of the hybrid approach, we plot the precision recall curve in Fixed-bit and Fixed¬ 
time experiment on VideoRec in Figure From the Fixed-bit experiments, Tree-Quantization methods have almost 
the same accuracy as their non-hybrid counterparts (note that the curves almost overlap in Fig. |^a) for these two 
versions), while resulting in about 6x speed up. From the fixed-time experiments, it is clear that with the same time 
budget the hybrid approaches return much better results because they do not scan all the datapoints when searching. 


7 Conclusion 

We have described a quantization based approach for fast approximate inner product search, which relies on robust 
learning of codebooks in multiple subspaces. One of the proposed variants leads to a very simple kmeans-like learning 
procedure and yet outperforms the existing state-of-the-art by a significant margin. We have also introduced novel 
constraints in the quantization error minimization framework that lead to even better codebooks, tuned to the problem 
of highest dot-product search. Extensive experiments on retrieval and classification tasks show the advantage of the 
proposed method over the existing techniques. In the future, we would like to analyze the theoretical guarantees 
associated with the constrained optimization procedure. In addition, in the tree-quantization hybrid approach, the tree 
partitioning and the quantization codebooks are trained separately. As a future work, we will consider training them 
jointly. 

8 Appendix 

8.1 Additional Experimental Results 

The results on ImageNet and VideoRec datasets for different number of top neighbors and different number of bits are 
shown in Figure]^ In addition, we compare the performance of our approach against PCA-Tree. The recall curves 
with respect to different number of returned neighbors are shown in Figure]^ 

8.2 Theoretical analysis - proofs 

In this section we present proofs of all the theorems presented in the main body of the paper. We also show some 
additional theoretical results on our quantization based method. 
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(a) ImageNet dataset, retrieval of Top-1, 5 and 10 items. 
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Top-10, b=512 
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(b) VideoRec dataset, retrieval of Top-10, 50 and 100 items. 


Figure 4: Precision Recall curves using different methods on ImageNet and VideoRec. 
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(a) Movielens, top-10 


(b) Netflix, top-10 




- L2 ALSH-FixedBit 
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-Simple LSH-FixedBit 

• - L2 ALSH-FixedTime 
Signed ALSH-FixedTime 
■ - ' Simple LSH-FixedTime 

- QUIP-cov(q) 

- QUIP-cov(x) 

- QUIP-opt 
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Percentage of Points Returned 
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Percentage of Points Returned 


50 


(c) VideoRec, top-50 


(d) ImageNet, top-5 


Figure 5; Recall curves for different techniques under different numbers of returned neighbors (shown as the percent¬ 
age of total number of points in the database). We plot the recall curve instead of the precision recall curve because 
PCA-Tree uses original vectors to compute distances therefore the precision will be the same as recall in Top-K search. 
The number of bits used for all the plots is 512, except for Signed ALSH-FixedTime, L2 ALSH-FixedTime and Simple 
LSH-FixedTime, which use 1536 bits. PCA-Tree does not perform well on these datasets, mostly due to the fact that the 
dimensionality of our datasets is relatively high (150 to 1025 dimensions), and trees are known to be more susceptible 
to dimensionality. Note the the original paper from Bachrach et al. [2] used datasets with dimensionality 50. 
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Figure 6 : Upper bound on the probability of an event A{ri) that a vector v obtained by the random rotation is not 
77 -balanced as a function of the number of subspaces K. The left figure corresponds to rj = 0.75 and the right one to 
77 = 0.5. Different curves correspond to different data dimensionality (d = 128, 256, 512,1024). 

8.2.1 Vectors’ balancedness - proof of Theorem [411] 

In this section we prove Theorem [4.1 j and show that one can also obtain balancedness property with the use of the 
random rotation. 

Proof. Let us denote v = (ui,..., vf) and perm(t;) = [Bi ,..., Bk], where Bi is the ith block (i = 1,..., K). Let us fix 
some block Bj. For a given i denote by Xf a random variable such that Xf = vf if Vi is the block Bj after applying 
random permutation and X^ = 0 otherwise. Notice that a random variable captures this part of the 

squared norm of the vector v that resides in block j. We have; 

Em ^ = ^\\v\\i ( 6 ) 

i=l i=l 

Since the analysis presented above can be conducted for every block Bj, we complete the proof. 

□ 

Another possibility is to use random rotation, that can be performed for instance by applying random normalized 
Hadamard matrix "Hn. The Hadamard matrix is a matrix with entries taken from the set { — 1,1}, where the rows form 
an orthogonal system. Random normalized Hadamard matrix can be obtained from the above one by first multiplying 
by the random diagonal matrix V, (where the entries on the diagonal are taken uniformly and independently from the 
set { — 1,1}) and then by rescaling by the factor where d is the dimensionality of the data. Since dot product is 
invariant in regards to permutations or rotations, we end up with the equivalent problem. 

If we take the random rotation approach then we have the following: 

Theorem 8.1. Let v be a vector of dimensionality d and let 0 < r] < 1. Then after applying to v linear transformation 
Tiny fhe transformed vector is rj-balanced with probability at least 1 — 2ae 2 , where K is the number of blocks. 

Proof We start with the following Azuma’s concentration inequality that we will also use later: 

Lemma 8.1. Let Xi, X 2 , ... be random variables such that = 0, E[Xi\Xi, ..., Xi_i] = 0 and —ai < Xi < j3i 

for 7 = 1,2,... and some ai, ai,..., (3i, P 2 , ■■■ > 0. Then {Xi, X 2 ,...} is a martingale and the following holds for any 
a > 0.' 

" 2af 
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Let us denote: v = (ui, ...,Ud). The jth entry of the transformed x is of the form: hj ^vi + ... + hj^dVd, where 
{hj i ,..., hj d) is the jth row of Hn and thus each hj i (for the fixed j) takes uniformly at random and independently a 
value from the set { — 

A 

Let us consider random variable Yi = + ■■• + hj^dVd)'^ that captures the squared L 2 -norm of the first 

block of the transformed vector v. We have: 




'^ 22 — 




j—l l<2i <i2<d 


K '' 


(7) 


where the last inequality comes from the fact that = 0 for ii ^ 12 - Of course the same argument is 

valid for other blocks, thus we can conclude that in expectation the transformed vector is 1-balanced. Let us prove now 
some concentration inequalities regarding this result. Let us fix some j G {1,..., d}. Denote = Vi^Vi 2 hjd^hjd 2 - 
Let us find an upper bound on the probability P(| X]i<ii<i 2 <(i1 > some fixed a > 0. We have already 

noted that *2] = «■ 

Thus, by applying Lemma [8T| we get the following: 

P(l E >a) (8) 

l<2l <22 


Therefore, by the union bound, PdYi — | > ^) < . Let us fix p > 0. Thus by taking a = 

and again applying the union bound (over all the blocks) we conclude that the transformed vector v is not p-balanced 


with probability at most 2de 


2 


That completes the proof. 


□ 


Calculated upper bound on the probability of failure from Theorem |8.1| as a function of the number of blocks K is 
presented on Fig. We clearly see that failure probability exponentially decreases with number of blocks K. 


8 . 2.2 Proof of Theorem 14.21 

If some boundedness and balancedness conditions regarding datapoints can be assumed, we can obtain exponentially- 
strong concentration results regarding unbiased estimator considered in the paper. Next we show some results that can 
be obtained even if the boundedness and balancedness conditions do not hold. Below we present the proof of Theorem 

10 

Proof. Let us define: Z = where: = q(k)T^(k) _ _ -y^g jj^ye: 

P(J'(a, e)) = V{{q^x > a) A {q^u^ > g^a:(l + e)) V {q^Uj^ < q^x(l - e))) 

< P(|g^x — q^Ux\ > ae) 

K 

= P(| x^^^ - > at) (9) 

fc=i 

K 

= P(|^Z«| >ae). 


Note that from Eq. we get: 

K 

P(.F(a,e))<P(|^zW|>ae). (10) 

k=l 

Let us fix now the /cth block {k = 1,..., K). From the ry-balancedness we get that every datapoint truncated to its 
/cth block is within distance 7 = + (1 — ? 7 ))r to (i.e. z truncated to its fcth block). Now consider in the 

linear space related to the /cth block the ball B , 7 ). Note that since the dimensionality of each datapoint truncated 
to the /cth block is we can conclude that all datapoints truncated to their /cth blocks that reside in B can 

be covered by c balls of radius r each, where: = c. We take as the set of quantizers u^i \ ..., for the /cth 
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Figure 7; Upper bound on the probability of an event J^(a, e) as a function of the number of subspaces K for e = 0.2. 
The left figure corresponds to 77 = 0.75 and the right one to rj = 0.5. Different curves correspond to different data 
dimensionality (d = 128, 256, 512,1024). We assume that the entire data is in the unit-ball and the norm of q is 
uniformly split across all K chunks. 


block the centers of mass of sets consisting of points from these balls. We will show now that sets: {u 
{k = 1,K) defined in such a way are the codebooks we are looking for. 

From the triangle inequality and Cauchy-Schwarz inequality, we get: 

\ < (max||g('')||2)(max||a:(''^ -u^x'^h) < 2qmaxr = 2qmaxic~^. 
qGQ xGX 


(k) 




1 5 •••5 


( 11 ) 


(k) (k)-\ 

This comes straightforwardly from the way we defined sets: {«![ ,..., Uq } for fc = 1,..., K. 

Let us ta ke: Xi = Thus, from (11 1 , we see that {Xi ,..., Xk} defined in such a way satisfies assumptions of 
Lemma 1 


8.1 


for Cfc — ‘^qmax^C 


Therefore, from Lemma [ 8 T| we get: 

K 


> ae) < 2 e ^ 


2K 

c~3r 


( 12 ) 


fc=l 


and that, by ([TOll, completes the proof. 


The dependence of the probability of failure X{a, e) from Theorem 4.2 on the number of subspaces K is presented 
on Fig. 0 

□ 


The following result is of its own interest since it does not assume anything about balancedness or boundedness. 
It shows that minimizing the objective function L = where: ~ 

leads to concentration results regarding error made by the algorithm. 

Theorem 8.2. The following is true: 


P(J-(a,e))< 


|X|a2e2 


Proof. Fix some k G {l,...,iT}. Let us consider first the expression — 

( 7 (^) 2 "yF=)) 2 ] Qjjj. algorithm aims to minimize. We will show that it is a rescaled version of the variance of the 

random variable Z. 

We have: 
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yar(Z('=)) = 

where the last inequality comes from the unbiasedness of the estimator (Lemma [3.1| l. 
Thus we obtain: 


Far(Z('=)) = (14) 

Therefore, by minimizing we minimize the variance of the random variable that measures the discrepancy 
between exact answer and quantized answer to the dot product query for the space truncated to the fixed fcth block. 
Denote Ux = {u^\ We are ready to give an upper bound on P(J^(a, e)). 

We have: 


K 

P(J-(a, e)) < F{\q^x - q'^Uxl > ae) = P(| - q^'^'>'^ui^'>)\ > ae) 

fc=i 


>ae) 

/c=l 

maxfcg{i„..,x}(^ar(gW^a;('=) - 

maxfc^i^, Var{Z^^'>) 


(15) 


The last inequality comes from Markov’s inequality applied to the random variable and the union bound. 

Thus, by applying obtained bound on Var{Z^'^'>), we complete the proof. 

□ 


8.2.3 Independent blocks - the proof of Theorem |4.3| 

Let us assume that different blocks correspond to independent sets of dimensions. Such an assumption is often rea¬ 
sonable in practice. If this is the case, we can strengthen our methods for obtaining tight concentration inequalities. 


The proof of Theorem 4.3 that covers this scenario is given below. 


Proof. Let us assume first the most general case, when no balancedness is assumed. We begin the proof in the same 
way as we did in the previous section, i.e. fix some /c € {1,..., K} and consid er ra ndom variable Z^^\ The goal is 
again to first find an upper bound on Var{Z^^^). From the proof of Theorem 
Then again, following the proof of Theorem 1 8.2 1 we have: 


8.2 


we get: Var{Z^^'>) = 


K 

P(.F(a,e))<P(| E Zl'')|>ae) (16) 

i=k 

We will again bound the expression P(| I ^ "'ll! '■1*® following version of the Berry- 

Esseen inequality ([11]): 

Theorem 8.3. Let {S'!,..., S'„} be a sequence of independent random variables with mean 0, not necessarily iden¬ 
tically distributed, with finite third moment each. Assume that X]r=i ^[^1] — 1- Define: W = Then the 

following holds: 

^ n 

|P(W^„ < x) - f{x)\ < E 

for every x and some universal constant C > 0, where 4>{x) = P (5 < x) and g ~ ff {0, 1). 
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Note that if dimensions corresponding to different blocks are independent, then {Z^^\ is the family of 

independent random variables. This is the case, since every Z^^'> is defined as: Z^^'> = Note 


that we have already noticed that the following holds: E[Z'^^^] = 0. Let us take: = » 


7{k) 


VELi Var(Z~) 


. Clearly, 


we have: Ylik=i £'[(<S'^^^)^] = 1- Besides, random variables 5*^^^ defined in this way are independent and = 0 

for fc = 1, K. Denote: 


K 


K 




k=l 






Thus, from Theorem|8.3|we get: 


(17) 


Therefore, for every c > 0 we have: 


< cc — (l){x)\ < 


C 


1 + a:3 


rF. 


(18) 


^^Y.LiVar{Zi>^)) 


> c = 1-: 


+p 


Ef=i2« 


Denote (t>{x) = 1 — ^(a:). Thus, we have: 


.^T.LiVar{Z^>^)) 


< c 




< —c 


< 1 - (j){c) + 0(-c) + 


2C 
1 + c3 


rF 


(19) 


K 


|^Z«| 


> Ca 


fc=l 


K \ 

A ^ yar(Z('=)) 1 < 1 - 0(c) + 0(-c) + 

\ k=i I 


2C 
1 + c3 


rF 


< 


- 9r' 

= 20(c)+ -^F 

1 + C3 

2 c2 2C 

+ ^^F, 




1 + c3 


( 20 ) 


where in the last inequality we used a well-known fact that: 0(a;) < 


1 —^ 
6 2. 


If we now take: c = 


VeLi Var(^:('=) 


K 


y/2TT: 


, then by applying ( 2^ to , we get: 


u 


y Z«| > ae) < g 2 (Ef^,lV%W ))2 


dnae 

2C 


K 


Substituting the exact expression for Var{Z^’^'>), we get: 


yF[|zwn, 


( 21 ) 


K 


yzW|>a6)<^5tl2^e = 


i^r 






^ 2C(gyFW)i 


t3,3 


|X|: 


A:=l 


( 22 ) 


Note that | = || 2 ||a;l^l — || 2 . The latter expression 

is at most q^ax^, by the definition of A and qmax- Thus we get: |Zl^9|3 < < a, where the last inequality 


follows from the assumptions on A from the statement of the theorem. Therefore, from 22 we get: 
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( 23 ) 



a2e3|X|i 


Thus, taking into account (16 1 and putting (3 = 2C, we complete the proof. 


□ 
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